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(57) Abstract: The present invention may provide a method and system for distributed regression modehng. An initial or source 
Q data set is in a partitioned form, i.e. in one or more subsets, and data and regression modules are developed separately for these data 
^ subsets. The subsets can consist of either complete data vectors or incomplete data vectors, or of a mixture of the two. The data 
\^ vectors may be incomplete, for example, since they contain missing vector components. 
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METHODS AND SYSTEMS FOR REGRESSION ANALYSIS OF 
MULTIDIMENSIONAL DATA SETS 



The present invention relates to systems and methods suitable for regression analysis 
5 of multi-dimensional data sets. The output of a regression model may be used for 
prediction, decision making, market analysis, fraud detection etc. The present 
invention relates to methods and systems, especially distributed processing systems as 
well as computer program products for regression analysis. 

10 Technical Background _ 

In non-parametric regression analysis, an unknown mathematical function /is 
modeled given a finite number of possibly inaccurate covariates or data points, without 
requiring any a priori assumptions other than perhaps a certain degree of smoothness. 
Hence, contrary to standard curve fitting procedures, no regression function parameters 
1 5 are optimized. The unknown fimction can be a scalar or a vector function. In the scalar 
(univariate) case, a function y =f[x), with x e V c SH"*, needs to be estimated from a 
given set of M possibly noisy input samples: 



20 



30 



y''= f(xO + noise, n= 1,...,M, 



(1) 



where the noise contribution has zero mean and is independent from the input samples 
x". In the vector (multivariate) case, the fimction y =/(x), with y e 5R'^is estimated as 
follows: 



25 y''=f(xO + noise, ^= l,...,M. 



(2) 



are 



The vector case is often treated as the extension of the scalar case by developing d, 
scalar regression models independently, one for each vector component. 

The regression performance improves if the number of "knots" in the model 
not fixed but allowed to depend dynamically on the data points: indeed, "dynamic- 
knot allocation is known to yield a better regression perfonnance than its static 
counterpart (Friedman and Silverman, 1989). Note that the "knots" are the points in V 
space that join piecewise smooth fimctions, such as sphnes, which act as interpolating 
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functions for generating values at intermediate positions. Alternatively, kernels are 
centered at these points (kernel-based regression). 

Traditionally, kernel-based regression proceeds as follows. At each training 
sample a kernel, such as a circular-symmetrical Gaussian, can be localized with a 
height equal to the corresponding desired output value y. all Gaussians have the same 
radius (standard deviation) which, in fact, acts as a smoothing parameter. Formally, the 
output of the regression model, for a given input x, can be written as: 



C>(x) = 2]>'^exp 



(3) 



10 



in the univariate (scalar) case, and 



0,(x) = 5]j;_^exp 



2\ 



, i = 1, dy. 



(4) 



15 in the multivariate (vectorial) case. In principle, a is directly related to the noise 

distribution of the input signal however, such a measure is not always available. On the 
other hand, the radius can always be determined by cross-vaHdation. The approach just 
described is adopted in the general regression neural network (GRNN) (Specht, 1991). 
Alternatively, localisation can be restricted to only a limited number of kernels, 

20 say, N circular-symmetrical Gaussians. The kemels are normalized so that 

interpolation between the kemels centers, eg^., using the normalized Gaussian can be 
carried out: 



25 



exp 



' 'lx-w,ir 



^,(X,/7,)=- 



N 

^exp 



2^ 



(5) 



The output of the regression model for a given input x is then defined as: 
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15 



in the univariate (scalar) case, and 



in the multivariate (vectorial) case. The ^.and W^t&ms are ad^onal parameters of 
the respective regression models. The output of the univariate regression model is 
1 0 intended to approximate the desired input/ou^ut mapping: 



0(xO=j>^«7^=/(x^). V//. (8) 
Similarly, the ouiput vector of the multivariate regression model can be written as: 

0(xO=r «y^=/(xO, V//, (9) 



with O = [OJ. The positions (centers) w,of the Gaussians are chosen in such a manner 
that the regression error, i.e., the discrepancy between the output of the model and the 

20 desired output, for a given set of input/output pairs is minimal; the radii are chosen in 
an ad hoc maimer. This is basicaUy.the Radial Basis Function (RBF) network approach 
introduced by Moody and Darken (1988). One could also extend this minimization of 
the regression error by includmg a common kernel radii cr, or even by mcluding 
variable radii in this process (Poggio and Girosi, 1990). 

25 In summary, the parameters of the kemel-based regression model, i.e. , the 

regression weights Wj and W,y, kernel centers w,and possibly also the common kernel 
radius cror the variable radii oj, are optimized by minimizing the regression error. This 
is usually done by a learning algorithm which iteratively adjusts these parameters until 
the regression error becomes minimal (e.g., for a separate test set of input/output pairs). 

30 This approach, however, provides optimization of the weights for a specific regression 
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application. For each new application the regression analysis has to be carried out on 
the source data. 

It is an object of the present invention to overcome some of the problems of the 
known systems, in particular to provide a more flexible method and system of 
5 preparing data for regression analysis. 

It is also an object of the present invention to reduce the amoimt of 
communication traffic when running regressions analysis on distributed data. 



Summary of the present invehtion 

1 0 The present invention may provide a method and system for distributed 

regression modeling. An initial or source data set is in a partitioned form, i.e. in one or 
more subsets, and data and regression modules are developed separately for these data 
subsets. The subsets can consist of either complete data vectors or incomplete data 
vectors, or of a mixture of the two. The data vectors may be incomplete, for example, 

15 since ttiey contain missing vector components. This could be due to a partitioning into 
subspaces. The ensemble of the data modules form the data model and, sinodlarly, the 
regression modules form the regression model. An advantage of the present invention 
is the minimization of the need for communication between the data and regression 
modules during their development as well as their subsequent use: in this way, delays 

20 due to communication are minimized and, since only model parameters need to be 
communicated, issues concerning data ownership and confidentiality are maximally 
respected. 

The present invention may provide a computer implemented regression method 
for canying out a regression application on a pluraUty of input data subsets of an input 
25 data signal, comprising the steps of: 

developing for each of the input data subsets a data model independently of the 
regression appUcation thus foraiing a set of data models; 

generating a regression model for the input data signal using the set of data 
models. The developing step may include a map developing step in which an 
30 unsupervised, self-organizing, kernel-based, topographic map is developed firom each 
input data subset by maximizing the entropy of the map's output, the topographic map 
being executed by a neural network; and during the development step receptive fields 
of the topographic map are truncated to receptive field regions, the receptive field 
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regions being independently scaleable in the space of the input data signal. The map 
may be equiprobabilistic. Preferably, at least some of the receptive field regions 
overlap. The receptive fields are preferably defined by the same kernel function. The 

i 

receptive field regions are preferably hyper-spheroidal. Each data model may comprise 
5 data points and an interpolating fimction and the map developing step forther ! 
comprises estunating a data density for each data point of a data model and the 

generating step includes the step of selecting a data point for the regression step based ' 
on the data density for that point. The developing step may comprise adapting the 
center location and the extent of each receptive field region to the input data density of 

10 the part of the input data signal data which activates the kernel thereafe The developing 

step may include use of any of the following methods: self-organizing maps (SOM), ; 
topographic maps, constrained topological mapping (CTM), transformation methods 
such as principal component analysis (PCA), independent component analysis (ICA), 
multidimensional scaling or similar methods, especially imsupervised methods. Each 

1 5 data record in the input data signal may be contained within one input data subset. 
However, each data record in the input data signal may be distributed over the input 
data subsets. Also some data records in the input data signal may be contained within 
one input data subset and some data records may be spread over some of the data 
subsets. The step of generating the regression model may comprise generating a 

20 regression sub-model for each of the data models, and combining the regression sub- 
models. The generating step may include use of any suitable regression analysis 
method such as multilayer perceptrons, (smoothing) splines, support vector machines 
(SVM), Radial Basis Function (RBF) networks, general regression neural network 
(GKNN), constrained topological mapping (CTM), Generative Topographic Map 

25 (GTM) and the mixture of experts network or similar. 

The present invention may comprise a computer based system for executing a 
regression application on a plurality of input data subsets of an input data signal, ; 
comprising: 

means for developing for each of the input data subsets a data model 
30 independently of the regression application to form a set of data models; and 

means for generating a regression model for the input data signal using the set 
of data models. The means for developing may include a neural network in which an 
unsupervised, self-organizing, kernel-based, topographic map is developed fi-om each 
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input data subset by maximizing the entropy of the map*s output, and means for 
truncating receptive field« of the topographic map to receptive field regions, the 
receptive field regions being independently scaleable in the space of the input data 
signal. The means for generating the regression model may comprise means for 
5 ' generating a regression sub-model for each of the data models, and means for 

combining the regression sub-models. Means for generating a data density model from 
each data model may also be provided. Each data model may comprise data points and 
an interpolation function, and the system may further comprise decision means for 
selecting which data point of each data model is used for the generation of the 

10 regression model based on the output of the means for generating the data density 
model for that data point, 

Tlic present invention includes a computer program product for causing a 
processing engine to execute any of the methods according to the invention. The 
present invention also includes data carrier storing code segments of a computer 

15 program for executing any of the methods according to the present invention. 

The present invention may make use of an unsupervised competitive learning 
rule for equiprobabilistic topographic map formation, called the kemel-based 
Maximum Entropy learning Rule (kMER) since it maximizes the information-theoretic 
entropy of the map's output as well as systems, especially distributed processing 

20 systems for carrying out the rule. Since kMER adapts not only the neuron weights but 
also the radii of the kernels centered at these weights, and since these radii are updated 
so that they model the local input density at convergence, these radii can be used 
directly, in variable kernel density estimation. The data density function, at any neuron 
is assumed to be convex and a cluster of related data comprises one or more neurons. 

25 The data density function may have a single radius, e.g. a hypersphere. 

The present invention may make use of a processing engine and a method for 
developing a kemel-based topographic map which is then used in data model-based 
apphcations. The receptive field of each kernel is disjunct from the others, i.e. 
overlapping. The engine may include a tool for self-organizing and unsupervised 

30 learning, a monitoring tool for maximising the degree of topology achieved, for 

example, by using the overlapping kernels, and a tool for automatically adjusting the 
kernel widths to achieve equiprobabilism. The receptive fields of the kernels may be 
convex, e.g. hyperspheroids or hyperspheres but the present invention is not limited 
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thereto. 

The engines and methods described above may be local or distributed. The data 
model and the processing on the data model may be local or distributed. 

By virtue of the topographic map learning rule, called the kernel-based 
5 Maximum Entropy learning Rule (kMER) all neurons of a neural network have an 
equal probability to be active (equiprobabilistic map) and, in addition, pilot density 
estimates are obtained By virtue of the learning rule, called the kernel-based Maximum 
Entropy learning Rule (kMER), pilot density estimates can be readily obtained since 
kMER is aimed at producing an equiprobabilistic topographic map. 
1 0 The present invention wiU now be described with reference to the-following 

drawings. 

Brief Description of the Drawings 

Figure 1 shows a distributed network with which the present invention can be 

15 used. 

Figure 2 shows a further distributed network with which the present mvention 
can be used. 

Figure 3 shows embodiments of the present iavention demonstrating Vertical 
modularity. (A) The data model and the regression model are developed separately and 
20 operate in sequence. (B) Several regression models can be grafted onto a common data 
model. 

Figure 4 shows Kernel-based maximum entropy learning and the data model 
which can be used with embodiments of the present invention. (A) Definition of the 
type of formal neurons used. Shown are the activation regions Si and Sj of neurons i 

25 andy (large circles) of a lattice A (not shown). The shaded area indicates the 

intersection between regions and .^.(overiap). The receptive fields centers w and 
are indicated with small open dots. (B) Neuron i has a localized receptive field kernel 
K(x - Yf,., oj), centered at w, iu input space The kernel radius corresponds with 

the radius of activation region 5,. The present input xeVis indicated by tiie black dot 

30 and falls outside S,. 

Figure 5 shows an optimized kMER algorithm, batch version, adapted for the 
case where the values of some input vector components are missing (marked as "don't 
cares" or x). 
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Figure 6 shows an algorithm for determining missing iaput vector 
components in accordance with an embodiment of the present invention. 

Figure 7 shows a kernel-based regression of a function;/ =fix), x = [jCj, Xj] 
using a two-step procedure: the data model and the regression model in accordance 
5 with an embodiment of the present invention. The data model comprises a topographic 
map of A'^ neurons of which the activation regions are determined with kMER (location 
W; and radii oj). The regression model consists of iV Gaussian kernels, with the same 
location and (scaled) radii as the activation regions in the data model {cf the icons next 
to and gxr)- Th^ outputs of these kernels, g^, g)v> are weighted by the parameters 
10 W^, so as to produce the regression model's output O. The W^, are trained 
with the delta rule in such a manner that the output O represents an estimate y of ttie 
function to be regressed, y -fix) (cf the icon on the top). 

Figure 8 shows embodiments of the present invention having horizontal 
modularity. Partitioning of the data set into subsets of the input space data set (A), or 
15 into subspace data sets (B), and a mixture of the two (C). 

Figure 9 shows a further embodiment of the present invention having 
horizontal modularity. Case of several subsets and one regression module. 

Figure 10 shows an embodiment of the present invention having multiple 
subsets and multiple regression modules. The latter are arranged at two levels. 
20 Figure 1 1 shows an embodiment of the present invention having multiple 

subsets and regression modules and a single decision module. 

Figure 12 shows an embodiment of the present invention having multiple 
subspaces and multiple regression modules at two levels. 

Figure 13 shows an embodiment of the present invention involving a mixed 
25 case, heuristic strategy: multiple subsets of complete and incomplete input vectors. The 
presence of incomplete data vectors is ignored. 

Figm-e 14 shows a further embodiment of the present invention involving a 
mixed case, correct strategy: the presence of complete data vectors is ignored. 

30 6. Definitions 

Data model: a lossy representation of main characteristics of a data set. The 
representations may be ones that reflect parameters of the density distribution, or 
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descriptors of its shape and extent, or the parameters of a projection of the data onto a 
subspace, or a transformation of the data such that a certain criterion is satisfied. The 
term "lossy" implies that the original data set is not contained within the data model 
and that the data model is a compacted representation in comparison with the source 
5 data set. 

Equiprobabilistic: All units (neurons) in the lattice (neural network) have an equal 
probability to be active. All representational units should participate with the same 
probability in the representation. All kernels contribute equally in the density estimate 
10 provided by the topographic map (prior class probabilities are equal, if kemeLrwould 
represent a different class). 

Entropy: entropy of a variable is the average amount of information obtained by 
observing the values adopted by that variable. It may also be called the uncertainty of 
15 the variable. 

Kernel-based: a particular type of receptive field - it has the shape of a local fimction, 
e.g. a function that adopts its maximal value at a certain point in the space in which it 
is developed, and gradually decreases with distance away firom the maximum point. 

20 

Topographic map: a mapping between one space and another in which neighboring 
positions in the former space are mapped onto neighboring positions in the latter space. 
Also called topology-preserving or neighborhood-preservmg m^ing. If ttiere is a 
mismatch in the dimensionaUty between the two spaces there is no exact definition of 
25 topology preservation and the definition is then restricted to flie case where 

neighboring positions in the latter space code for neighboring positions in the former 
space (but not necessarily vice versa). 

Receptive field: the area or region or domain in the input space within which a neuron 
30 (generaUy synonymous with synaptic element of a neural network) can be stimulated. 

Self-organizing: refers to the genesis of globally ordered data structures out of local 
interactions. 
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Non-parametric density estimation: no prior knowledge is assumed about the nature or 

shape of the input data density. Non-parametric regression: no prior knowledge is 

assumed about the nature or shape of the function to be regressed. j 

5 . j 

Description of the illustrative embodiments i 

I 

The present invention will be described with reference to certain embodiments 
and with reference to certain examples and to certain drawings but the present 

iQvention is not limited thereto but only by the claims. In particular the present j 

10 invention will be described with reference to generating a data model using an ' 

equiprobabilistic, xuisupervised topographic map but the present invention is not j 

limited thereto. Other methods such as self-organizing maps (SOM), topographic | 

maps, constrained topological mapping (CTM), transformation methods such as i 

principal component analysis (PCA), independent component analysis (ICA), j 

15 multidimensional scaling or sin:iilar methods, especially unsupervised methods, or I 

similar may be used. • 

I 

i 
I 

1. Distributed processing systems j 

Fig. 1 shows a basic network 1 0 parts or the whole of which may be used with j 
20 embodiments of the present invention.. It comprises two major subsystems which may ! 
be called data processing nodes 2 and data intelligence nodes 3, 4. A data processing 
node 2 comprises one or more microprocessors that have access to various amounts of 
data which may include very large amounts of data. The microprocessor(s) may be 
comprised in any suitable processing engine which has access to peripheral devices 
25 such as memory disks or tapes, printers, modems, visual display xmits or other 

processors. Such a processing engine may be a workstation, a personal computer, a 
main frame computer, for example or a program running on such devices e.g. a UNIX 
workstation or a personal computer with a Pentium m processor from Intel, USA and 
running Windows 98 operating system from Microsoft USA. The data may be 
30 available locally in a data store 1 or may be accessible over a network connection such 
as via the Intemet, a company Intranet, a microwave link, a LAN or WAN, etc. The 
data may be structured such as is usually available in a data warehouse or may be "as 
is", that is unstmctured provided it is stored in an electronically accessible form, e.g. 
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Stored on hard discs in a digital or analogue format. The processing engine is provided 
with software programs for running an algorithm in accordance with the present 
invention to develop a data model which is independent of the regression application to 
be applied. The data model is preferably based on a topographic representation of the 
5 input data but the present invention is not limited thereto. Particularly preferred is a 
competitive learning algorithm capable of producing a topographic equiprobabaUstic 
density representation (or map) of the input data having a hnear mapping of real input 
data densities to the density represented by the topographic representation (the 
algorithm is described in more detaU below). This topographic map may be described 
10 as a graph 7 of data models that represent the data processed. Each data model may be 
described as a compacted representation of the source data and comprises data points 
and a real function. The combination of the data points and the real function provide at 
least an approximate model of the source data. Each data model is stored on suitable 
storage medium and is a data structure in accordance with the present invention. The 
15 node 2 can use a persistent medium to store the graph of data models 7, or it can send it 
directly to other nodes in the network, e.g. nodes 3 and 4. Optionally, the data 
processing node 2 can be distributed over a network, for example a LAN or WAN. A 
data processing node 2 can run on most general purpose computer platforms, 
containing one or more processors, memory and (optional) physical storage. The 
supported computer platforms include (but are not limited to) PC's, UNIX servers and 
mainframes. Data processing nodes 2 can be interconnected with most existing and 
future communication links, mcluding LAN and WAN systems. A data processmg 
node 2 can also have a persistent storage system cqjable of storing all the information 
(data) that is been used to generate flie graph 7 of data models that are stored or 
maintained by this node or a sub-gr^h of the graph 7. The graph 7 of data models can 
be saved regularly to the persistent medium for the analysis and monitoring of changes 
and the evolutions in the data (e.g. trend detection). 

A data processing node 2 can also run other or additional algorithms that can be 
used to process data or pre-process or prepare data ready for analysis (e.g. to capture 
tune dynamics in data sets). The data sets that are used can be both structured data (e.g. 
databases) or unstructured data (e.g. music samples, samples of pictures). The only 
Umitation is that the data should be offered in a format that can be processed by the 
chosen computer platform as described above. 



20 



25 



30 
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A data processing node 2 can provide a data intelligence node 3, 4 with an 
individual data model, a sub-graph or the complete graph 7 of data models. Nonnally, 
only the data models are returned not the data itself. However, it is (optionally) 
possible to return also the data that is used to generate the data models. 

The graph of data models that is build up and maintained by a data processing 
node 2 contains: 

1) A number of datanodes that contain the data models that describe at least a part of 
the data set. 

2) A number of directed links between the datanodes. 



Note: datanodes should not be confused with nodes of a distributed systan such as 2, 
3, 4 of Fig. 1. The word "Datanode" refers to a data model which models at least a 
portion of the mput data to be processed. A datanode may be a software node. A 
datanode may be a neural network which models a part of a topographic 
15 equiprobabilistic density map, for example, a part generated by clustering after 
application of the novel competitive learning algorithm described above. 

The datanodes and directed links are preferably organized in a hierarchical 
system, that is in a tree, in which topographic maps from two or more levels overlap. 

There are a few special types of datanodes: 

20 1) The top level datanode contains the data model from the complete data set and 

from this node, all the other datanodes in the tree can be reached from this top level via 
the dfrected links. It has only originating directed links - it has no terminating single 
directed link from another datanode. 

2) A leaf datanode is a datanode that has no originating, single directed links to other 
25 datanodes. In a system that has gone through the complete initial training phase in 
accordance with the competitive learning algorithm in accordance with the present 
invention, this means that the data m this datanode has a substantially homogeneous 
distribution and that it is not relevant to split the data further. It can be said that a leaf 
node cannot be resolved into clusters which can be separated, that is it is 
30 "homogeneous", or any discernible clusters are of such minor data density difference 
that this difference lies below a threshold level. 

All the other (intermediate) datanodes m the tree between the top datanode and 
a leaf datanode describe: 



iNSIXXID: <WO_0180176A2.L> 



wo 01/80176 




PCT/BEOl/00065 



a) A subset of the complete dataset with common characteristics, but that can be 
refined further (i.e. without a imiform distribution). . 

b) The data model of the data described by this datanode. 

5 During an initial training phase the following steps are taken which results in the 
formation of the datanode tree from top-down: 

a) The top-level model is generated (i.e. generation of the top datanode). 

b) This top-level model is divided in several parts in accordance with rules described 
in detail below. The division is not a simple tiUng of the top level topographic map. 

10 c) Each of these parts describe a subset of the complete dataset and a data model is 
generated for this subset, i,e. either intermediate or leaf datanodes or a mixture of the 
two is generated. 

d) The data described by an intermediate datanode can be divided further into other 
mtermediate datanodes and/or leaf datanodes. If it is not possible to divide an 
15 intermediate datanode further, this intermediate datanode is a leaf datanode. 

After the initial run, the graph produced is a tree, from top-datanode to leaf datanode. 

One of the additional advantages of the data processing nodes 2 in accordance 
with the present invention is the capability to distribute functions over the network 10. 
20 There are several advantages in distributing the data processing nodes across the 
network 10: 

a) Higher performance. Although the algorithms in accordance with the present 
invention are capable of running on parallel computer systems as they are almost 

25 linearly scaleable, additional computing power may be advantageously provided by 
another computer. 

b) Privacy: for example, the European privacy mles for personal information limit the 
transfer of personal mformation that is collected between companies. For data mining 
purposes, the characteristics of individuals are not usually important but the higher- 

30 level information or attributes of these individuals. 

c) Integration or cost reasons. It is sometimes not feasible to install a complete data 
store, e.g. a data warehouse in one location. 
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The system in accordance with the present invention can fulfill all these requirements 
through a number of different distribution schemes. Several combinations of different 
distributed network schemes are possible of which two are described below with 
reference to Fig. 2 which shows a more complex network 20 than that of Fig. 1 . 
5 In a system 20 designed for master/slave processing of sub-graphs with 

common data source in accordance with an embodiment of the present invention, new 
data or data that is used to retrain the data models is provided to a master data 
processing node such as 15. Optionally, this node 15 retrains the main data model (top 
datanode in the graph of data models). In addition, it may determine to which 

10 intermediate datanode(s) this data point belongs and send it to the processoT(s) 

responsible for this (these) intermediate datanode(s). Such an intermediate datanode 
can belong to another data processing node, somewhere else in the network 20, called a 
slave data processing node 12 - 14. 

As the processing requirements increase, it is possible to add in additional slave 

1 5 data processing nodes such as 1 1 to the network 20. A slave processing node can also 
act as a master processing node for another slave processing node depending on the 
network configuration, e.g. m Fig. 2, the nodes 1 1, 12 are slaves of the master data 
processing node 13 on top of the data, but this node 13 is also in a slave relationship to 
the master node 15. 

20 The data modeling methods of the present invention may be used witii parallel 

processing. For example, a distributing master processing node 16 collects all the 
initial data and carries out top level clustering. It determines a pluraUty of independent 
clusters, e.g. two. It decides to process the first of the clusters and send the second 
cluster with its associated data to the processing node 15 for parallel processing of the 

25 data. Data updates will be received by the distributing master node 16 and the master 
node 16 determines if the data is to be processed by itself or by the alternative 
processing engine in node 15. To do this, the master node 16 keeps a mapping between 
the records and the relevant processing engine. For example, the parallel processing 
system may include a plurahty of independent processing units each comprising a 

30 processor, local disk storage, and local memory. The independent processing units may 
be coimected to a high perforaiance switch which provides communications between 
the processing units. A controller work station is connected into the system for input of 
data and display of results. One such system is IBM Corporation's RISC System/6000 
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Scalable POWERparallel system. 

In an alternative embodiment, the tree structure of the graph 7 of data models 
may be at least partly mapped to a hierarchical master/slave network 20 as described 
above. A specific embodiment of the use of a master/slave network will now be 
5 described. Initially, the complete graph 7 of data models is set up on every processing 
node. After this, the individual processing nodes assume their allotted task of 
processing their specific part of the tree. For this purpose they each receive the relevant 
data subset from the master processing node. From the other nodes in the system, each 
processing node receives the weighting factor updates for the neurons other than the 

1 0 ones the node processes. With the updated factors introduced into the graph 7, each 
node can process new data while accurately with the influence of other intermediate 
datanodes and leaf nodes being modeled correctly. As only weighting factors are 
shipped around the network and not data, the amount of signal traffic is much reduced. 
This aspect of the present invention is a direct consequence of the linear density 

1 5 mappmg of the topographic map generated by the competitive learning algorithm in 
accordance with the present invention. Only if the data which is associated with a 
cluster or clusters can be separated cleanly and accurately from the total data is it 
possible to part process safely the graph 7 in a distributed way. If the data density 
estimation is only approximate, the labehng of data to be associated with a specific 

20 cluster/neurons is inaccurate. This means that a significant proportion of wrong data is 
shipped to each slave node (that is data which should not be contributing in the 
processing of the datanode processed on this processing node but should be 
contributing on another node). This falsifies the results at each node. As only the 
weighting factors are shipped around the system and these may be sensitive to the 

25 incorrect data, the whole graph 7 may become an incorrect decision making tool witiiin 
a short period of time. 

In accordance witii a farther embodiment of tiie present invention slave/slave 
processing of a complete graph without a common data source but with a centralization 
point (Master node) is provided. The total graph 7 of data models can be generated by 

30 several data processing nodes in a slave/slave configuration tiiat have individual access 
to different (separated) data sets. In one embodmient the different slave data processing 
units process their own data and send the updates (revised neural weighting factors) 
regularly to a centiral data processing node, that unifies the different updates into a 
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single consistent data model. There are two types of distribution of data: 

1) Horizontally distributed: the different datasets describe the same characteristics 
(i.e. columns in database table), but different subjects (i.e. different records in this 

5 table). 

An example: the customer database of a multinational company will be distributed 
over several databases in the different countries, each database describing the same 
attributes but only of the local customers. 

2) Vertically distributed: the different datasets describe different characteristics of the 
10 same (group of) subjects 

An example: the salary database, the human resources database, the restaurant database 
. all for the same employees of one company. 

A combination of horizontal and vertical distributed data is also possible - a hybrid 

1 5 system in which a data record may be completely contained within one database or 

may be spread over some or all other databases. The way that a distributed network 20 
as shown in Fig. 2 may be operated with vertically or horizontally distributed data 
form separate embodiments of the present invention. With horizontal distribution, each 
local processing node processes its own data. If it is necessary to query the whole data 

20 model, it is possible to devise queries which are sent around the network and collect 
the answers from each processing node and bring it to the questioning node, e.g. the 
master node. Schemes in accordance with the present invention which involve local 
processing of distributed data and local generation of the data model may be applied 
for cost or time-to-rostall reasons, i.e. to eliminate the need for collecting together a 

25 centralized data warehouse or for privacy reasons, if, for instance, the different data 

sets belong to different companies. This exemplifies an aspect of the present invention 
which is to leave data where it is and to only ship queries, answers or at most 
abstracted versions of the data (data models) aroimd the network. For instance, an 
alternative embodiment involves generating a data model locally from the local data at 

30 one node and retrieving only the data models from other processing nodes. This would 
mean, in the example above that the data models from various countries would be 
brought together at one processing node. This would allow querying of all these 
models on a single computing platform. This may seem wasteful in data transfer in 
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comparison to only sending the query around the network and collecting the answers. 
However, there may be advantages in doing so - e.g..the answers and the queries may 
be kept more private or the query is of the type that all the data must be present at one 
location. Anyway, it is still less data transfer than collecting and updating a data 
5 warehouse. 

It is also possible to separate some other data processing functions that are not 
really related to the generation and maintenance of the graph of data models. For 
example, a data processing node 16 can serve only as a device that collects the graph of 
data models from the other data processmg nodes, and updates the data analysis nodes 
l'^ if required. 

Data intelligence nodes provide the additional logic (called application 
components) needed to run the regression ^Ucations such as the following. A data 
intelligence node contains a processing engine and may be a workstation, a personal 
computer, a main frame computer, for example or a program running on such devices 
15 e.g. a UNDC workstation or a personal conqjuter with a Pentium UI processor from 
Intel, USA and running Windows 98 operating system from Microsoft USA. The data 
analysis system in accordance with embodiments of the present invention can be used 
in the following non-limiting Ust of regression applications: direct marketmg, quality 
assurance, predictions, data analysis, fraud detection, optimizations, behavior analysis, 
20 decision support systems, trend detection and analysis, intelligent data filtering, 

intelligent splitting of data sets, data mining, outlier detection. Application components 
relate to programs run to analyze the graph 7 of data models or part of it to obtain a 
useful, concrete and tangible result. It is an aspect of the present invention that the data 
model generation is kept separate from the regression appUcations. When the 
regression appHcation is partly mixed into the data model generation, the data model 
becomes restricted in its use. One aspect of the present invention is that the data model 
is as neutral as possible, i.e. it is not determined by the specifics of the data nor by the 
specifics of the regression application. This allows the data models in accordance with 
the present invention to have a longer useful Ufe. New queries and new problems can 
be investigated on the data model. Due to the linear mapping of the real data densities 
into the topographic map, the data model is an accurate representation of the data 
despite being highly compressed in size. 

There are basically two different data inteUigence nodes (Fig. 2): 



25 



30 
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1) The Data intelligence node 17, 18 for machine-machine interaction is designed to 
offer data mining intelligence to other applications. 

2) The Data intelligence node 4 for exploratory data mining applications, that allows a 
5 human analyst 5 to analyze data interactively. 

A data Intelligence node 17, 18 for machine-machine interactions is a node containing 
one or more processors that has access to at least a sub-graph of the graph 7 of data 
models as generated by a data processing node. The node 17, 18 contains the 

10 predefined logic to execute a specific application as needed by another computer 
system. For this purpose, this node 17, 18 offers the possibility to answer queries 
relating to these specific applications through a number of standardized machine-to- 
machinc interfaces, such as a database interface (ODBC/JDBC), middle-ware 
interfaces (CORBA, COM) aiid common exchange formats (text files, XML streams, 

15 HTML, SGML). 

This node 17, 18 can be connected to a data processing server 16 that serves at 
least a sub-graph of the graph 7 of data models using any LAN or WAN connection. A 
constant connection is not required, a data intelligence node with a persistent storage 
system can store the (graph of) data models locally and an application component can 

20 be made available that can detect if the data model has to be resynchronized. A node 
17, 1 S can nm on the same platform as a data processing node, it can nm as a part of 
another application or it can run on small handheld devices (small computers, mobile 
phones). It is also possible to combine a data processing node and a data intelligence 
node on a single (physical) machine. 

25 A data InteUigence node 4 for exploratory data mining (machine-man interface) 

is a node containing one or more processors that has access to a at least a sub-graph of 
the graph 7 of data models, analogue to a data intelligence node 17 18 for machine- 
machine interface, but this node 4 also requires a visualization device that allows the 
user to browse in the graph 7 of data models and the results provided by the data 

30 models. The user can analyze the data and run an application component for selected 
data (e.g. any of the application described above). Specific appUcation components can 
help the analyst to detect specific behavior patterns. 

It is possible to run most of the regression applications in the data intelligence 
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nodes in an interactive mode. 

In accordance with an embodiment of the present invaition a plurality of 
multidimensional source databases containing data to be processed are first prepared (if 
this necessary) and a data model is determined for each one, e.g. using a set of 
5 competitive unsupendsed competitive learning rules (kMER). Optionally, a non- 

parametic density model is built up from the neuron weights and the radii obtained by 
kMER for each data model. ' 

The present invention may be implemented on a computing system comprising 
computing devices e.g. a personal computers or work stations which have an input 

1 0 device for loadmg data. The computing devices are adapted to run software which 

canies out any of the methods in accordance with the present invention. The computer 
may be a server which is connected to a data communications transmission means such 
as the Internet, a Local Area Network or a Wide Area Network. A script ffle including 
the details of the data subsets of the data iiq)ut signal may be sent from one or more 

15 near locations, e.g. terminals, to a remote, i.e. second location, at which the server 
resides. The server receives this data and outputs back along the communications line 
useful data to the near terminal, e.g. a result of the regression analysis. 

Typically a computer has a video display terminal, a data input means such as a 
keyboard, and a graphic user interface indicating means or pointer such as a mouse. A 

20 computer 10 includes a Centaral Processing Unit ("GPU"), such as a conventional 

microprocessor of which a Pentium m processor suppUed by Intel Corp. USA is only 
an example, and a number of other units interconnected via system bus. The computer 
includes at least one memory. Memory may include any of a variety of data storage 
devices known to the skilled person such as random-access memory ("RAM"), read- 

25 only memory ("ROM"), non-volatile read/write memory such as a hard disc as known 
to the skilled person. For example, the computer may further include random-access 
memory ("RAM"), read-only memory ("ROM"), as well as an optional display adapter 
for connecting system bus to an optional video display terminal, and an optional 
input/output (I/O) adapter for connecting peripheral devices (e.g., disk and tape drives) 

30 to system bus . The video display terminal can be the visual output of computer, which 
can be any suitable display device such as a CRT-based video display well-known in 
the art of computer hardware. However, with a portable or notebook-based computer, 
video display terminal can be replaced with a LCD-based or a gas plasma-based flat- 



3CID: <WO 01801 76A2J_> 



wo 01/80176 




PCT/BEOl/00065 



20 

panel display. Computer further includes a user interface adapter for connecting a 
keyboard, mouse, optional speaker, as well as allowing optional physical value inputs 
from physical value capture devices of an external system which input the data signal. 
The devices may be any suitable equipment for capturing physical parameters. 
5 A computer also includes a graphical user interface that resides within a 

machine-readable media to direct the operation of computer. Any suitable machine- 
readable media may retain the graphical user interface, such as a random access 
memory (RAM) , a read-only memory (ROM), a magnetic diskette, magnetic tape, or 
optical disk (the last three being located in disk and tape drives ). Any suitable 

10 operating system and associated graphical user interface (e.g., Microsoft Windows) 
may direct the CPU. In addition, the computer includes a control program which 
resides within computer memory storage. The control program contains instructions 
that vrhen executed on the CPU carry out the operations described with respect to the 
methods of the present invention. The instructions may be obtained by writing a 

1 5 computer program in a suitable language such as C or C-H- for execution of any of the 
methods m accordance with the present invention and then compiling the program so 
that it executes on a computer. 

Those skilled in the art will appreciate that the hardware may vary for specific 
appUcations. For example, other peripheral devices such as optical disk media, audio 

20 adapters, or chip programming devices, such as PAL or EPROM progranmiing devices 
well-known in the art of computer hardware, and the like may be utilized in addition to 
or in place of the hardware aheady described. 

The present invention also includes a computer program product (i.e. control 
program for executing methods in accordance with the present invention comprising 

25 instmction means in accordance with tiie present invention) can reside in computer 
storage. The instructions (e.g., computer readable code segments in storage) may be 
read from storage into the RAM. Execution of sequences of instructions contained in 
the RAM 24 causes the CPU to perform the process steps described herein. In 
alternative embodiments, hard- wired circuitry may be used in place of or in 

30 combination with software instructions to implement the invention. Thus, 

embodunents of the invention are not Umited to any specific combination of hardware 
circuitry and software. Accordingly, the present invention may take the form of an 
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entirely hardware embodiment, an entirely software fcmbodiment or an embodiment 
combining software and hardware aspects. 

Furthermore, the present invention may take the form of a data carrier medium 
(e.g. a computer program product on a computer-readable storage medium) carrying 
5 computer-readable program code segments embodied in the medium. The terms 
"carrier medium" and "computer-readable medium" as used herein refer to any 
medium that participates in providing instructions to a processor such as a CPU for 
execution. Such a medium may take many forms, including but not limited to, non- 
volatile media, volatile media, and transmission media. Non-volatile media includes, 

10 for example, optical or magnetic disks, such as a storage device which is part of mass 
storage. Volatile media includes dynamic memory such as RAM. Transmission media 
mclude coaxial cables, copper wire and fiber optics, including the wires that comprise 
a bus within a computer, such as bus. Transmission media can also take the form of 
acoustic or light waves, such as those generated during radio wave and infra-red data 

15 communications. 

Common forms of computer-readable media include, for example a floppy 
disk, a flexible disk, a hard disk, magnetic t^e, or any other magnetic medium, a CD- 
ROM, any other optical medium, punch cards, paper tapes, any other physical medium 
with patterns of holes, a RAM, a PROM, an EPROM, a FLASH-EPROM, any other 
memory chip or cartridge, a carrier wave as described hereafter, or any other medium 
from which a computer can read. 

These various forms of computer readable media may be involved in carrying 
one or more sequences of one or more instructions to a processor for execution. For 
example, the instructions may initially be carried on a magnetic disk of a remote 
computer. The remote computer can load the instructions into its dynamic memory and 
send the instructions over a telqjhone hne using a modem. A modem local to the 
computer system can receive the data on the telephone line and use an infrared 
transmitter to convert the data to an infra-red signal. An infra-red detector coupled to a 
bus can receive the data carried in the infra-red signal and place the data on the bus. 
30 The bus carries data to main memory, from which a processor retrieves and executes 
the instructions. The instructions received by main memory may optionally be stored 
on a storage device either before or after execution by a processor. The instinctions can 
also be taransmitted via a carrier wave m a network, such as a LAN, a WAN, or the 
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Internet 

However, it is important that while the present invention has been, and will 
continue to be, that those skilled in the art Avill appreciate that the methods of the 
present invention are capable of being distributed as a program product in a variety of 
5 forms, and that the present invention applies equally regardless of the particular type of 
signal bearing media used to actually carry out the distribution. Examples of computer 
readable signal bearing media include: recordable type media such as floppy disks and 
CD ROMs and transmission type media such as digital and analogue communication 
links which may be used for downloading the computer program product, 

10 With reference to Figs. 1 and 2 arid the above description, the modular 

approach of the present invention will be described. This modular approach allows data 
models, data sub-models, regression models, regression sub-models and decision 
modules to be located on one computing device or distributed through a distributed 
network as described vnth reference to Figs. 1 and 2. Further, the various modules of 

15 the present invention may be implemented on a parallel processor architecture or on a 
multi-threaded single computer. With reference to the figures the plurality of 
independent data sets and data models may be located in a distributed fashion 
throughout a distributed network. 

2. A modular approach 

The present invention and its embodiment provides a modular approach to data 
modeling and regression. In essence, the data models are generated independently of 

25 the regression task. This is done with an optimization procedure that operates on 

samples drawn from the input distribution (in F-space). This results in what is called 
a data model. Then, for a given regression task and, thus, for a given set of 
input/output pairs, only the regression weights Wj and J^y are optimized so as to 
minimize the regression error; the data model parameters are kept constant. This 

30 second, regression model is specified by the regression application at hand. The data 
and regression models are, thus, developed separately, and they operate in sequence 
(vertically modularity, see Fig.3A). 

It is particularly preferred if the data model is developed in accordance with a 
kernel-based topographic map formation procedure. As a result, the kernel centers w,. 

35 and kernel radii ai are obtained in such a manner that an equiprobabilistic map is 
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obtained: each kernel will be activated with, equal probability and the map is a feithfiU 
representation of the input distribution. Other unsupervised data model generators may 
be used such as self-organizing maps (SOM), topographic maps, constrained 
topological mapping (CTM), trausfonnation methods such as principal component 
analysis (PCA), mdependent component analysis (ICA), multidimensional scaling or 
similar methods, especially unsupervised methods. 

The regression models may be, for example, multilayer perceptrons, 
(smoothing) splines, support vector machines (SVM), Radial Basis Function (RBF) 
networks, general regression neural network (GRNN), constrained topological 
mapping (CTM), Generative Topographic Map (GTM) and the mixture of experts 
network or similar 

The present embodiment provides the following advantages: 

• the regression surfece wiU be locally more detailed when locally more samples are 
available and vice versa 

• several regression appUcations can be grafted onto the same data model (Fig. 3 B) 

• the set of input samples can be split into subsets for which separate regression 
modules can be developed. The ensemble of these modules then forms the regression 
application (horizontal modularity, see Fig. 8A) 

• the input space can be split into subspaces for which regression modules can be 
developed that together form the regression appUcation (horizontal modularity, see Fig. 
8B) 

• the input samples x^ may be incomplete, Le., some of the vector components may 
be missing (incomplete data handling) 

Hence the present embodiment is modular in two ways: vertically modular 
(data model and regression model) and horizontally modular (data modules and 
regression modules). 



30 



3.0 Vertical modularity embodiment 
3.1.0 Data model 

The data model consists of a lattice A, with a regular and fixed topology, of 
arbitrary dimensionality d^, in ^/-dimensional input space Fc SR''. To each of the N 
nodes of the lattice corresponds a formal neuron which possesses, in addition to the 
traditional weight vector w„ a hypervolume activation region which may be circular- or 
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hyperspherical, hypear-oval or similar in shape. Such an activation region iS".-, has a 
radius a;, in F-space (Fig. 4A). The neural activation state is represented by tihe code 
membership Action: 

fl if xsS, 

''^^^io if <1«) 
with X € K Depending on 1,., the weights w,. are adapted so as to produce a topology- 
preserving mapping: neighboring neurons in the lattice code for neighboring positions 
in F-space. The radii o} are adapted so as to produce a lattice of which the neurons 
have an equal probability to be active (equiprobabilistic map), i.e., 

(x) = l) = Vz , with p a scale factor. 



3.1.1. Training the data model 

The data model is trained as follows. A training set M == {x^ of M input 
samples is considered. In batch mode, the kernel-based Maximum Entropy learning 
Rule (kMER) updates the neuron weights w,. and radii o} as follows (Van HuUe, 1998, 
15 1999b): 

Aw, ;£A(z,7,^^)3.(x^)S'g72(x^~w,.), and, (U) 

A^/=^E f^(l-l/(x0)--ll/(x0\ Vr. (12) 
^ pN 

with Pj. = ^^^ , and A(.) the usual neighborhood function, e.g., a Gaussian, of which 

the range a^^ is gradually decreased during learning ("cooling"), and S, the fiizzy code 
20 membership function of neuron z: 

so that 0 < Hi(x) ^ 1 and J^.H, (x) = 1 . > 1 is used so that the activation regions 

overlap. Finally, the previous batch learning scheme is .also available for incremental 
learning. The skilled person in the art of neural network learning is aware of batch or 
25 incremental learning techniques. 

The characteristics of the trained data model can be suimnarized as follows: 
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1 . the lattice fonns a discrete estimate to a possibly non-linear data manifold in the 
input space V 

2. the lattice defines a topology-preserving mapping from input to lattice space: 
neighboring neurons in the lattice code for neighboring positions in F-space 

5 3. more neurons Avill be allocated at high density regions in F-space ("faithful" map, 
VanHuUe, 1999b). 

3.1.2. Incomplete data handling 

The input data x = [xj used for training could be incomplete: for some vector 

10 components there is no value available. These vector components will be treated as 
"don't cares", Xj = x. In particular, the active neurons are determined and the weights 
and radii are updated only by using the input vector components for which a value is 
available. For example, an adapted kMER algorithm is given in Fig. 5, in batch mode, 
but it can equally well be shown for incremental learning. 

15 Filling-in missing entries 

Once the data model is developed, the missing entries can be filled in by 
applying the mcomplete input vector to the map, the neuron with the closest weight 
vector can be determined, by ignoring the missing vector components, and the latter 
can be replaced by those of the closest weight vector. The algorithm is shown in Fig. 6. 

20 Instead of using only the closest weight vector, all active neurons could be selected, 
and their weight vectors could be used for deriving estimates for the missing input 
vector components. One way to do this is by determining the average of the weight 
vectors, and then perform the substitution using this averaged vector. The neuron radii 
could even be invoked to determine the average. For example, if vector component Vj is 

25 missing, then the following can be done: 

V ^ 

wifli the set of the active neurons, and d the dimensionality of the input space. Still 
another way is to do the following: 



OCID: <WO 01801 76A2J_> 



wo 01/80176 



PCT/BEOl/00065 



26 

with p(yVi) the density (estimate) at position w,- in F-space (for the use of density 
estimates, see below). 



10 



3.2 Regression application 

When the topographic map A is developed, thus, without using infonnation 
about the desired outputs, the circular (spherical, hyperspherical) activation region of 
each neuron can be supplemented with a kernel, e.g.^ a normalized Gaussian: 



exp 



-ii^-''.r 



EN 



Vie .4, 



(16) 



with/7, a smoothing factor. These kernels are then used as the neurons' output 
functions. The output of the regression model for a given input x can be written as 
follows: 

7=1 

in the univariate (scalar) case, and 



Z=l, rfy. 



(17) 



(18) 



15 in the multivariate (vectorial) case. 

Hence, there are two types of unknown variables in the regression model: the 
weights Wp V/, or W^y, Vr j, in the univariate or the multivariate case, respectively, and 
the smoothing factor p^. 



20 3.2,1- Training the regression application 

Several types of algorithms are included within the scope of the present 
invention. In the following the simplest one is described, based on gradient descent, 
but the present invention is not limited thereto. A set of input/output pairs is 
considered, i.e., the training set: M = {(j;^,x^) | // = 1, m]. Training the weights Wj 
25 is performed with the delta rule: 

AfF, =7{3;^-J^.^,(x^,/7,))g.(x^,/7j, (19) 



INSDCXJID: <WO 01601 76A2_L> 



wo 01/80176 PCT/BEOl/00065 



10 



20 



27 

in the case of univariate regression. The rule is readily extendible to the multivariate 
case by developing one series of weights for each output vector component. The 
following training set is considered: M = {(j;^,x^) | // = 1, m], with = [yf]. 
The delta rule now becomes: 

A^, = r/^yf -W,ygj(x^,p^))gj(x^,p^x (20) 
Projection pursuit regression 

Another embodiment of the present invention apphes projection pureuit 
regression (PPR) (Friedman, J.H., and Stuetzle, W. (1981) Projection pursuit 
regression. J. Am. Statist, Assoc. 76 (376), 817-823) to kernel-based regression with 
topographic maps. In PPR, the ^-dimensional data points are inteipreted through 
optimally-chosen lower dimensional projections; the "pursuit" part refers to an 
optimization with respect to these projection directions. For simphcity. the case will be 
considered where the function/is approximated by a sum of scalar regression models 

15 0(x-)^j>^=|o,(a,x^^), (21) 

with a, the projection directions (unit vectors) and where T stands for transpose. Each 
output O, refers a regression model of the type given in eq. (1 8). The parameters of O, 
and projections a, are estimated sequentiaUy in order to minimize the mean squared 
error (MSE) of the residuals: 



1 M 



0^(a,x'"^)-]3,^ -go,,(a^,x"'^) 



(22) 



and where the term between the curly brackets denotes the Ath residual of the /rfh data 
point, with rC = y^. In order to further optimize the projection directions, 
backfitting can be applied: C(a^) is cychcally minimized for the residuals of projectioi 
direction k until there is Httle or no change. 
25 Determining the smoothing factor 

Training for the smoothmg factor/?, can also be done with the delta rule, 
however, better results (z.e., more rehable than would be achieved by learning) can be 
obtained in the following manner. The following quantity is defined as the training 



error: 
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TE = ^i;ly--0(.-| (23) 

with O = [OJ, with O/ as in eq. (1 8). In a modification of this method a test set rather 
than a training set is used, and the test error is determined rather than the training error. 
The decision depends on how many training samples are being used, and how many 
5 are available for testing. 

TE is plotted as a function of and the /7,-value that minimizes TE is sought. 
A clear parabolic (TE, plot is expected. Only three points are then necessary to 
locate the minimum, theoretically, but the effect of numerical errors must be 
considered. 

1^ For example, the following strategy can be adopted for determining three TEs 

(assuming that > 1 in kMER so that the activation regions overlap): 

1 . starting with = 1 , the regression model is trained and the first training error, TE\ 
is determined 

2. is increased, e.g. , to 1 . 1 , the regression model is trained again and the second 
15 training error, TE^, is determined 

3. p^ is decreased, e.g., to 0.9, the regression model is trained again and the third 
training error, TE^ is determined. 

From the relative magnitudes of TE\ TE^ TE\ the location can be estimated, 
or the directipn where the minimum should be sought determined. 

20 

3.2.2. Summary. 

The characteristics of the present regression embodiment can be suimnarized 
as follows: . 

1 . the data model can be developed separately from the regression model: it is simply 
25 an add-on of the data model; 

2. training the regression model will be fast since only one stage ("layer") of 
connection weights need to be trained; 

3 . the regression procedure adapts itself to the data (self-organization): smce with 
kMER the weight density will be proportional to the input density, the regression 

30 model will be optimal in the sense that it will be locally more detailed when there are 
locally more input data available (higher density in x-space), and vice-yersa\ 
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4. by virtue of the delta rule and the smoothness procedure, regression modeling is 
easily implementable. 

4.0 Horizontal modularity embodiment 

An important feature of the kernel-based approach is that the regression model 
does not extrapolate beyond the range of its outermost kernel (which can be easily 
detected): a zero output will be obtained. This enables to develop a horizontally- 
modular approach to regression and, thus, to exploit paralleUsm. Basically, there are 
two ways to look at this type of modularity: the data can be partitioned into subsets of 
the input space data set or into subspace data sets (Fig. 8A, B). Evidently, also 
mixtures of the two can be considered (Fig, 8C). 



15 



20 



4.1. Subsets of data - single regression module 

Assume m subsets of input/output pairs are available (see "Training subsets 
input vectors" and "Training subsets desired output scalars" in Fig. 9). Each subset of 
input vectors is used for developing a corresponding data model (data model 1, m). 
The weights w, and radii oj available in the m data models are then used for 
mtroducing kernels g,- in the regression module. The denominator in the nomiaUzed 
Gaussians definition refers to all kernels of all data models. Finally, the W-s and p, are 
determined, across the m data models,.according to the scalar regression task. 



4.1.1. Vectorial regression tasic 

The former can be extended to a vectorial regression task, and desired output 
vectors can be considered instead of scalars. ^ regression models are developed, one 
25 for each output vector component. This can be done in paraUel. 

4.2. Subsets of data - multiple regression modules 

The previous regression modeling strategy can be extended by breaking up the 
single regression module into m + 1 regression modules at two levels (Fig. 10): at the 
30 first level, there are m regression modules, one for each subset of input/output pairs, 
and at the second level, there is one module that mtegrates the outputs of the first level 
modules in order to produce the final regression output O. The m level 1 regression 
modules are trained as before. For the level 2 module, the following weighted sum of 



wo 01/80176 PCT/BEOl/00065 



30 

the outputs is taken as the level 1 regression modules: 

m 

0{x) = Y,WOjOj{x,p,). (24) 

Again, the WOj are parameters that can be trained in the sanoie way as explained before, 
or in a similar way. If the subsets are perfectly disjunct, then one can simply take the 
5 sum of all Oj. 

Bayesiau approach 

Another embodiment that does not require additional training will now be 
mtroduced: it will take advantage of the fact that the regression surface developed in 
each module will be more detailed when more samples are available. Hence, if there is 
1 0 disposed of a density estimate in each module, then there will be more certainty of the 
regression output produced by a given module if flie corresponding input also 
corresponds lo a high local density. 

Rather than having a common second level regression module as in Fig. 10, 
which requires global learning, a common decision module is introduced, which does 
15 not require additional training (Fig. 1 1). It goes as follows. Let pj{x \j) be the 
(conditional or component) density estimate of the yth module for input x. The 
regression output produced by the module, / * for which the product of the conditional 
density and the prior probability is the largest is taken: 

/■* = arg min (p j (x | j)P(j)), . (25) 

20 with P{/) the (prior) probability of the yth module. Unless disposing of prior 
knowledge, all P(J)s can be assumed to be equal to 1/m, hence: 

/* = arg mm pj (x (26) 

For both cases, the regression output O becomes: 

0(x) = 0,.(x). (27) 

25 In other words, the regression output is taken for which the largest number of local 

input/output pairs were available for training. This selection criterion is reminiscent of 
a Bayesian pattern classification procedure (whence the procedxire's name). The 
selection itself occurs in the decision module. An estimate of the probability density of 
each module is obtained from the data modules is explained in the articles Van HuUe, 

30 M.M. (1998), Kernel-based equiprobabihstic topographic map foimation. Neural 
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Computat, 10(7), 1847-1871, Van Hulle,M.M. (1999a). Hill-climbing, density-based 
clustering and equiprobabiUstic topographic maps. Internal Report Synes, Van HuUe, 
M.M. (1999b), Faithful representations with topographic maps. Neural Networks, 12, 
803-823, in the book 'Taithfull representations and topographic maps" Marc van 
5 HuUe, John Wiley & Sons, 2000 and co-pending International Patent Apphcation 
PCT/BEOO/00099 which is incorporated herein by reference. 

An alternative is to interpolate between the regression outputs produced by all 
modules according to their density estimates. The regression application's output O can 
then be written as follows: 

Bayesiaii approach - Risk minimization 

Maximizing the product of the conditional density and the prior probability 
might not be appropriate for some applications. For example, when the consequence of 
erroneously taking the output of module i is much higher than that of module/ In 
1 5 order to take such effects into account, one can define a loss matrix L = [L„] of which 
each element Ly is the loss incurred by erroneously taking module j when actually 
module / should have been taken. The risk is then minimized when: 

t,h^p{x\i^m*)<i,L^p(x\j)PU), yj^i*. (29) 

The can be based on experience or expressed in a more quantitative manner, for 
20 example, in monetary terms. 
Advantages 

The advantages of the present embodiment can be summarized as follows: 

1 . No additional training is required, a feature which is particularly important when 
dealing with regression modules that are developed on separate computers (data 

25 servers): in this way, no global communication between the data servers is needed m 
order to build a distributed regression apphcation. In the more traditional procedure, a 
global communication of the training samples would be requned. 

2. A new set of data that one wishes to consider in the regression apphcation can be 
included without retraining the whole application: the new data set can be regarded as 

30 a subset for which a separate data module and regression module are developed. This 
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can be done in parallel, thus, while the original regression application is in use, 
3. Each regression module can be supplemented with a risk factor so that risk 
minimization can be performed when considering the outputs of the regression 
modules. 

5 

4.2.1. Vectorial regression task 

The previous embodiment can be extended to a vectorial regression 
application, thus with desired df^-dimensional output vectors, rather than scalars. dy 
regression models are developed in parallel, one for each output vector component. All 
1 0 training processes can run in parallel. 

4.3. Subspaces — multiple regression modules 
43.1. Scalar regression task 

The complementary problem is now considered: the data set is partitioned 
15 along m disjunct subspaces. There is first focused on the input vectors. The input space 
of the data set is partitioned along m subspaces of which the dimensionalities are rf,., 0 

< < d, 2ir^/ = . For example, if input sample x = (x^, x^) is considered, then 

data module 1 is developed using the subspace vector (jc^ , x^^ ) , data module 2 

using (x^^+i , ) , and finally, data module m using , x^^ ) . 

20 In the event that the input data for these subspaces are used for developing the 

data modules on separate data servers, then this modeling paradigm cannot be 
addressed without relymg on a global communication. The present embodiment 
addresses how much communication is needed, and what is the nature of the 
information that is commimicated. 

25 Since the data modules are developed separately from the regression 

apphcation, and since kMER is used for it, an elegant solution can be found that 
minimizes the need for global communication. This results in a new procedure, 
compatible with the one introduced in the previous subsection. For the sake of 
exposition, it is assumed that the subspace data sets are stored onto separate data 

30 servers. However, it is clear that they can be grouped onto one or more servers. The 
overall scheme is shown in Fig. 12. 

For each data server, a data module is developed with kMER using the 
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subspace vectors that are locaUy available. It is assumed that the desired output values 
(scalars) are available on each data server, m (level 1) regression modules are then 
locally developed, as explained in section 3.2.1, thus, using the subspace vectors as 
input vectors. The outputs of the m regression modules still have to be integrated into 
one global regression result. This is done by applying backfitting (see section 3.2. 1). In 
order to generate an initial regression estimate, from which backfitting can start, the 
outputs produced by each of the m regression modules are averaged: 



m 1 

0(xO=^j>^ = i;-0,(a,x^). 



(30) 



with a vector, the components of which are 1 when the corresponding input 
10 dimension is coded for by the kth regression model, else it is 0. Note the resemblance 
with projection pursuit regression (PPR), except that here the a* are not subject to 
optimization. 

Next, ihc following algorithm is considered that is applied to a training set 
that is allowed to be different from the one used for training the m regression modules. 

15 

Bac/^tn'ng imialization 

For each module^, the module's regression output O/a^ x^) is determined for 
each input vector of the training set. These outputs (i.e., scalars) are then 
communicated to all other data servers. Hence, in this way, each data server disposes 
20 of the regression outputs produced by all regression modules for all the input vectors 
used for training. A module index, i, is introduced and is put i 1. Let she a preset 
level of training error. The total regression error (TRE) is defined as follows: 

T>^=]^Z(^''-|^0,(a.x-)). (31) 

which is preferably determined for a separate test set, rather than the training set. The 
25 TRE is communicated to regression module 1. 

do until TRE < s or until maximum number of iterations is reached 
for (i <= 1; z < m; j <= i + 1) 



backfitting run i 



30 Module z's regression parameters are adapted so as to reduce the total regression 
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error (gradient descent step, in batch mode); the parameters of all regression modules 
are kept constant. The regression outputs of module i are determined for each subspace 
input vector of the training set and communicated to the next module on which a 
learning step is going to be performed, module 2 + L The module index is incremented, 
5 and another backfitting run is performed. 



hi summary, a continued cyclical update of the modules is carried out until the total 
regression error is small enough or a more elaborate stop criterion has been satisfied 
(e.g,, either TRE < ^or the maximum number of iterations has been reached). 

10 It should be noted that only a restricted amount of information is exchanged 

between the data servers, and only when needed during learning. Furthermore, only 
model outputs are exchanged and not input data, which would require more 
communication effort and could, e.g,y raise concerns about owneiBhip and 
confidentiality of the data, etc, 

1 5 Finally, when backfitting has stopped, the regression modules can be used as 

follows. First, the sub^ace input vectors are applied to each regression module, then 
the corresponding regression module outputs are determined and communicated to 
the level 2 regression module which, in turn, calculates the final regression result: 

m -I 

Oixn^r-T-Oki^,-^"^). . (32) 

20 4.3.2. Vectorial regression task 

The previous embodiment can be easily extended to vectorial regression, thus 
with ^^-dimensional desired output vectors, rather than scalars: dy regression models 
are developed, one for each output vector component. All training processes can run in 
parallel. 

25 

4.4. Subsets of subspaces - mixed case 

The case where the data modules are trained on subsets that consist of 
mixtures of input space vectors as well as subspace vectors is now considered, thus 
with missing input vector components (see Fig, 8C). There are two possibilities. 
30 If there are plenty of input space vectors available for developing the data 

modules and/or the subspace vectors only have a limited number of missing vector 
components, then the presence of the missing vector components can be ignored and 
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the strategy mentioned under section 4.2 can be applied: m regression models are 
developed using the input space vectors as .veil as the subspace input vectors (Fig. 13). 
For the subspace input vectors the data completion procedure explained in section 
3.1.2 is used. 

5 In the opposite case, if there are plenty of subspace vectors available, then the 

strategy mentioned under section 4.3.1 should be considered, and the data modules 
should be developed for each subspace separately, thereby ignoring the additional 
vector components that some of the data vectors might have (Fig. 14). 

The first strategy is clearly an heuristic one: its success will critically depend 

10 on the ratio between the number of complete and incomplete input vectors, and the 

number of missing input vector components. The second strategy is in principle correct 
but it requires much more data communication than in the previous case. 

Finally, there is also the general case where the data vector dimensionalities 
vary in the training subsets. The subspace dimensionality that is used for each data 

1 5 module might correspond to the set of vector components that are in common to ttie 
training subset or, if this is too extreme, a common set of vector components can be 
chosen or determined, and data completion can be performed on ttie missing vector 
components and the ones that do not belong to this common set can be simply ignored. 
Evidently, such an approach is also an heuristic one. 



20 



4.4.1. Vectorial regression task 

Again, vectorial regression application can be performed by developing dy 
regression models in parallel. 



25 5. Utility 

The systems and methods described above may be used to model a physical 
system using a plurality of data models. A regression analysis may then be applied to 
the modeled system using the plurality of data models. A physical parameter of the 
modeled system may be changed, optimized, or controlled in accordance with the 
30 output of a regression model determined in accordance with the present invention. The 
output of the regression model is tangible, concrete and useful. 
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CLAIMS 

1 . A computer implemented regression method for carrying out a regression 
application on a plurality of input data subsets of an input data signal, comprising the 
steps of: 

developing for each of the input data subsets a data model independently of the 
regression application to fomi a set of data models; 

generating a regression model for the input data signal using the set of data 
models. 



15 



10 2. The method according to claim 1, wherein the developing step includes a map 

developing step in which an unsupervised, self-organizing, kernel-based, topographic 
map is developed &om each input data subset by maximizing the entropy of the map's 
output, the topographic map bemg executed by a neural network; and 
during the development step receptive fields of the topographic map are truncated to 
receptive field regions, the receptive field regions being independently scaleable in the 
space of the input data signal. 

3. The method accordmg to claim 2, wherein the map is equiprobabiUstic. 

20 4. The method according to claim 2 or 3, wherein at least some of the receptive field 
regions overlap. 

5. The method according to claun 4, wherein the receptive fields are defined by the 
same kernel fimction. 



25 



6. The method according to claim 5, wherein the receptive field regions are hyper- 
spheroidal. 



7. The method according to any of the claims 1 to 6, wherein each data model 
comprises data points and a real fimction and the map developing step further 
comprises estimating a data density for each data point of a data mode] and the 
generating step includes the step of selecting a data point for the regression step based 
on the data density for that point. 
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8. The method according to any of the claims 2 to 7, wherein the developing step 
comprises adapting the center location and the extent of each receptive field region to 
the uiput data density of the part of the input data signal data which activates the kernel 

5 thereof. 

9. The method according to claim 1, wherein the developing step includes use of any 
of the following: self-organizing maps (SOM), topographic maps, constrained 
topological mapping (CTM), transformation methods such as principal component 

10 analysis (PCA), independent component analysis (ICA), multidimensional scaling or 
similar methods, especially unsupervised methods. 

10. The method according to any previous claim wherein the generating step includes 
any of the following: multilayer perceptrons, (smoothing) splines, support vector 

15 machines (SVM), Radial Basis Function (RBF) networks, general regression neural 
network (GRNN), constrained topological mapping (CTM), Generative Topographic 
Map (GTM), the mixture of experts network. 

11. The method according to any previous claim, wherein each data record in the input 
20 data signal is contained within one input data subset. 

12. The method according to any of the claims 1 to 10, wherein each data record in the 
input data signal is distributed over the input data subsets. 

25 13. The method according to any of the claims 1 to 10 wherein some data records in 
the input data signal are contained within one input data subset and some data records 
are spread over some of the data subsets. 

14. The method according to any previous claim wherein the step of generating the 
30 regression model comprises generating a regression sub-model for each of the data 

models, and combining the regression sub-models. 

15. A computer based system for executing a regression appHcation on a plurality of 
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input data subsets of an input data signal, comprising: 

means for developing for each of the input data subsets a data model 
independently of the regression appHcation to form a set of data models; and 

means for generating a regression model for the input data signal using the set 
5 of data models. 

16. The computer based system according to claim 15, wherein the means for 
developing includes a neural network in which an unsupervised, self-organizing, 
kemel-based, topographic map is developed from each input data subset by 
1 0 maximizing the entropy of the map's output, and 

means for truncating receptive fields of the topographic map to receptive field regions, 
the receptive field regions being independently scaleable in the space of the input data 
signal. 

15 17. The computer based system of claim 15 or 16 the means for generating the 

regression model comprises means for generating a regression sub-model for each of 
the data models, and means for combining the regression sub-models. 17. The 
con^uter based system of any of claims 14 to 16, further comprising means for 
generating a data density model firom each data model. 

20 

18. The computer based system according to claim 17, wherein each data model 
comprises data points and a real function, further comprising decision means for 
selecting which data point of each data model is used for the generation of the 
regression model based on the output of the means for generating the data density 

25 model for that data point. 

19. A computer program product for causing a processing engine to execute any of the 
methods according to claims 1 to 14. 



30 



20. A data carrier storing code segments of a computer program for executing any of 
the methods according to claims 1 to 14. 
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Lattice ij = Euclidean distance {neuronic neuronj)in lattice coordinates 
Missing values for input vector components vj are marked as "x" 

for(r<= l;/< w;/<=r+ 1) 

/* Initialize Aw/,- and Ad to zero */ 
for(z <= 1; / <iV; / <= z + 1) ' 
7 • forO* <=l;j< d;j <=y 1) 

AWy <= 0 




14 



forCz/<= l;//<M;/i<=/y+ 1) 
/* Compute discrepancy (v, w) */ 
/* Determine closest neuron and active neurons */ 



for(/ <= 1; count <= 0; z < N\ i<=i'\' 1) 
forO* <= 1, D2i <= 0;y ^ d;J + 1) 
if( vj' ^ x) /* iVb^ ^ missing entry */ 

if(D2/ < 1)2,*) 



ifm<c7f) 



CMi<=l 



count <= caz/n/^ -f 1 



else 



CPIi <= 0 



28 



/* The rest of the algorithm stays the same, as in (Van Hulle, 1999a) and is 
given on the next page */ ^ 



'9- 
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Lattice^ = Euclidean distance (neuroni. neuronj)i„^,i„^^i^„ 
forit <= I; t:^t^;t<=t+ I) 

I* Initialize Avv,y arid Ao} to zero *l 

for(i <=l;i<M;i<=i+l) 

for(/' <= l;j < d;j <=/ + 1) 



A<3j<=0 



o;\ <= Oao • exp '■" 



for(//<= l',fi<M; 1) 

select V from {distribution set) 

l*Compute discrepancy (v, w) and determine closest neuron and active 
neurons*! 

for(z <= 1; ;c, <= CO, count <= 0; i < N\ i <= z + 1) 
for(/' <= 1, D2, <= 0;j < d;j <=J + 1) 

Dy<=V^ -Wy 

]yz,<^iyii+ Dfj 

if(D2,<,x,) 

winner <= i 
ifp2,<a'f) 

count <= count + 1 

else 

if(cowwf = = 0) /* No neurons active */ 
coM«r c= 1 

C^winner ^ Closest neuron made active */ 

Fig. 5 ^ohJT, 
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N 



/* Adapt radius */ 



/* Compute partial update */ 
for(z<irl;/<iV;f<ii:z4-l) 

iXiCMi = = 0) /* Non-active neurons V 



D3 



' 1 



-exp 



( Lattice}^, 



count 

for(/ <= \\j < d;J <=y + 1) 

AWij .<= Awy + D3 • sign(Pfj) 

' . N 

else /* Active neurons */ 

for(/<=l;y<^f;y<=y-f-l) 



Aw a <= Aw., -f i— 



count 



AcTj <= Ad: - 1 



/* Batch update of weight vectors */ 
for(/ c= 1; / < A^; / <=: z + 1) 

for(/'c= l;y <i;j <=:7-h 1) 

M/^. <= Wy + 7 . Aw,y 

<= oj-r 7- Aoj 



Fig. 5 cont 
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Given: input vector of which some components vj are missing "x" 
Find: estimates for missing vector components 

7 /* Compute discrepancy (v, w) 
/* Determine closest neuron */ 

for(/ <= 1; count <= 0; z < N; z <= z + 1) 
for(/* <= 1, D2^ <=z 0;7 < d;j <^j -f 1) 
if(vy ^ x) /* Not a missing entry */ 

14 D2/<=£)2/+ Dr. 

if(D2i<D2i*) 
z* <^= i 

/* Fill-in missing ii^put vector components */ 
for(/*<=l;y<c/;;<=7 + l) 

if( vj' = = x) /* Missing entiy */ 

21 vj* <=W/*y 

Figixre 6 
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